Classifying Brand Preference

Rhys Hewer

1 Executive Summary

1.1 Objective

Market research was conducted into customer brand preferences. Not all of the brand preference responses were correctly captured. The objective of this analysis is to use the demographic data of the correctly captured responses to predict the brand preferences of those incomplete responses.

1.2 Method

The correctly captured data was split into a training set (75%) and a testing set (25%). Three machine learning models (K-Nearest Neighbour, Random Forest, Gradient Boosting Machine) were run on the training set to create a series of models.

The most promising of these models, that of the Gradient Boosting Machine, was then run against the test set to allow us to gauge how accurate our model is. We were able to develop a model of approximately 92% accuracy

1.3 Findings

The finalised model was deployed against the incomplete responses to predict the brand preference. These predicted responses were then combined with the existing responses to form a final data set.

The headline figures of this are that 62% of customers prefer Sony, 38% of customers prefer Acer.

We were also able to define customers who prefer Acer as follows:

  • Salary 25k - 75k, Age 60 - 80
  • Salary 75k - 125k, Age 40 - 60
  • Salary 50k - 100k, Age 20 - 40

1.4 Recommendations

  • Further research into the reasons for preferring one brand over another before committing to a deeper strategic relationship.

2 Initial Data Processing

#load libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(corrplot)
library(caret)
library(tidyr)
library(kableExtra)
library(parallel)
library(doParallel)
library(rattle)
library(rpart)
library(gridExtra)
library(ROCR)
#read in data
setwd("C:/Users/rhysh/Google Drive/Data Science/Ubiqum/Project 2/Task 2")
incomplete <- read.csv("SurveyIncomplete.csv")
origdata <- read_xlsx("Survey_Key_and_Complete_Responses_excel.xlsx", sheet = 2)
data <- origdata

As we are working across 2 spreadsheets it is important to check to see if they are structured alike or whether manipulation will be required to ensure they have the same features.

#Check structure of data to see if alike
names(incomplete) == names(data)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The features match between the spreadsheets so no data manipulation is needed in that respect.

incomplete %>% sample_n(5)
##         salary age elevel car zipcode    credit brand
## 1681  40088.94  72      0  11       2 144261.04     0
## 3975  76260.55  43      0   2       5  12807.45     0
## 3032  55525.27  49      0  10       0 310990.52     0
## 4272 114305.97  22      4  13       5 383973.00     0
## 3506  79312.86  68      3  12       6 115222.44     0
data %>% sample_n(5) 
## # A tibble: 5 x 7
##    salary   age elevel   car zipcode  credit brand
##     <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl> <dbl>
## 1  85467.    26      1    13       0      0      0
## 2 109015.    70      1    16       2 382216.     1
## 3 143473.    39      3    16       6 293273.     1
## 4  31702.    50      2     2       6 283791.     1
## 5 106030.    62      4     5       5 336324.     0

A quick look at a sample from both spreadsheets shows that they are very similar in composition.

str(incomplete)
## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  110500 140894 119160 20000 93956 ...
##  $ age    : int  54 44 49 56 59 71 32 33 32 58 ...
##  $ elevel : int  3 4 2 0 1 2 1 4 1 2 ...
##  $ car    : int  15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: int  4 7 3 1 1 2 1 0 2 4 ...
##  $ credit : num  354724 395015 122025 99630 458680 ...
##  $ brand  : int  0 0 0 0 0 0 0 0 0 0 ...
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10000 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : num  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : num  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : num  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: num  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : num  0 1 0 1 0 1 1 1 0 1 ...

Looking at the structure shows that a few changes are needed in respect of the data types.

Overall, however, the datasets are sufficiently similar. I will take the strategy of splitting into training, test and final prediction sets. The training and testing sets will come from the complete responses data. Final prediction from the incomplete responses.

The exploratory data analysis will be performed on the complete data but any data transformations taking place on the training/testing sets will also need to be made on the Final prediction data prior to modelling this.

#check for NAs
data %>% is.na() %>% sum()
## [1] 0
incomplete %>% is.na() %>% sum()
## [1] 0

There are no missing values in either data set.

#data types
data$elevel <- data$elevel %>% as.factor()
data$car <- data$car %>% as.factor()
data$zipcode <- data$zipcode %>% as.factor()
data$brand <- data$brand %>% as.factor()

Education level, car owned, zip code and brand preference are converted from numeric to factors.

##check for outliers
numericVars <- Filter(is.numeric, data)
outliers <- numericVars %>% sapply(function(x) boxplot(x, plot=FALSE)$out) %>% str()
## List of 3
##  $ salary: num(0) 
##  $ age   : num(0) 
##  $ credit: num(0)

There are no outliers.

3 Exploratory Data Analysis

3.1 Feature Histograms

3.1.1 Brand Preference Histogram

#EDA

##Key feature is brand preference, begin with exploring this value.

g6 <- ggplot(data, aes(brand, fill = brand)) +
        geom_bar() +
        theme_bw() +
        scale_fill_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Frequency") + 
        ggtitle("Brand Preference Frequencies")
g6

  • We see a clear preference for brand 1 over brand 0.

3.1.2 Histogram of other variables

##review other histograms for skewdness
histData <- origdata %>% select(-brand)
g8 <- ggplot(gather(histData), aes(value)) + 
        geom_histogram(bins = 20, fill = "#D95F02", colour = "white") + 
        theme_bw() +
        facet_wrap(~key, scales = 'free_x') +
        xlab("Value") + 
        ylab("Count") + 
        ggtitle("Histograms of Numeric Variables")
g8

  • Across the remaining features we see no extreme skewing or noteworthy patterns.

3.2 Brand Choice Plotting

3.2.1 Brand Choice v Salary

## Plot Brand choice v other variables
g1 <- ggplot(data, aes(brand, salary, fill = brand)) +
        geom_violin() +
        theme_bw() +
        scale_fill_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Salary ($)") + 
        ggtitle("Brand Preference v Salary")
g1 <- ggplotly(g1)
g1
  • There is a clear pattern: salaries ranging between 45k - 100k seem to have a preference for brand 0. Salaries outside of this range seem to have a preference for brand 1.

3.2.2 Brand Choice v Age

g2 <- ggplot(data, aes(brand, age, fill = brand)) +
        geom_violin() +
        theme_bw() +
        scale_fill_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Age") + 
        ggtitle("Brand Preference v Age")
g2 <- ggplotly(g2)
g2
  • There is no noteworthy pattern between brand preference and age.

3.2.3 Brand Choice v Education Level

g3 <- ggplot(data, aes(brand, elevel, colour = brand)) +
        geom_count() +
        theme_bw() +
        scale_color_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Education Level") + 
        ggtitle("Brand Preference v Education Level")
g3

  • The general preference for brand 1 is shown in this plot but there also seems to be a consistent preference for brand 1 regardless of education level whereas there seems to be more variation within preference for brand 0 based on the education level.

3.2.4 Brand Choice v Car

g4 <- ggplot(data, aes(brand, car, colour = brand)) +
        geom_count() +
        theme_bw() +
        scale_color_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Car") + 
        ggtitle("Brand Preference v Car")
g4

  • The general preference for brand 1 is shown in this plot but there also seems to be a consistent preference for brand 1 regardless of car whereas there seems to be more variation within preference for brand 0 based on the car owned.

3.2.5 Brand Choice v Zip Code

g5 <- ggplot(data, aes(brand, zipcode, colour = brand)) +
        geom_count() +
        theme_bw() +
        scale_color_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Zip Code") + 
        ggtitle("Brand Preference v Zip Code")
g5

  • The general preference for brand 1 is shown in this plot but there also seems to be a consistent preference for brand 1 regardless of zip code whereas there seems to be more variation within preference for brand 0 based on the car owned.

3.2.6 Brand Choice v Credit

g7 <- ggplot(data, aes(brand, credit, fill = brand)) +
        geom_violin() +
        theme_bw() +
        scale_fill_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Credit") + 
        ggtitle("Brand Preference v Credit")
g7 <- ggplotly(g7)
g7
  • There is no noteworthy pattern between brand preference and credit.

3.3 Initial Hypothesis

There is a general preference for brand 1 and salary seems the key feature.

Age and credit seem to have limited influence on brand preference. Whilst there are patterns of preference within education level, car and zip code, salary appears to have the most striking impact.

Salaries ranging between 45k - 100k seem to have a preference for brand 0. Salaries outside of this range seem to have a preference for brand 1.

3.4 Feature Selection

3.4.1 Decision Tree Variable Importance

We can review the initial hypothesis and provide some information on which to make feature selection decisions using a decision tree.

featureDT <- rpart(brand ~ ., data = data)
featureDT$variable.importance
##        age     salary        car    zipcode     credit     elevel 
## 2036.06398 1421.30552   78.59978   42.70644   21.20404   16.06770
fancyRpartPlot(featureDT)

The decision tree is formed exclusively of age and salary, suggesting these are the key features. This is reinforced by the variable importance information from within the model.

3.4.2 Supplementary Exploratory Data Analysis

g12 <- ggplot(data, aes(salary, age, colour = brand)) +
        geom_point(show.legend = FALSE) +
        facet_grid(brand ~ .) +
        theme_bw() +
        scale_color_brewer(palette="Dark2") +
        xlab("Salary") + 
        ylab("Age") + 
        ggtitle("Salary v Age by Brand Preference")
g12

We see that the interaction of age and salary do play a role in brand preference. There are clearly defined blocks of preference depending on age and salary. For example, between ages 60-80 earning between 25k-80k have a distinct preference for brand 0.

I have updated the initial hypothesis to now consider that age and salary are the key features in determining brand preference.

3.4.3 Colinearity and Variance

##Review correlation matrix
corrMatrix <- numericVars %>% cor()
corrMatrix %>% corrplot.mixed()

Reviewing the correlation plot we see very little correlation between the features. This means that colinearity is not an issue that needs to be addressed.

#near zero variance
nzv <- data %>% nearZeroVar(saveMetrics = TRUE)
nzv
##         freqRatio percentUnique zeroVar   nzv
## salary   1.112069         97.53   FALSE FALSE
## age      1.178744          0.61   FALSE FALSE
## elevel   1.035946          0.05   FALSE FALSE
## car      1.026415          0.20   FALSE FALSE
## zipcode  1.020105          0.09   FALSE FALSE
## credit   1.057377         97.51   FALSE FALSE
## brand    1.643405          0.02   FALSE FALSE

There are no features with near zero variance.

3.4.4 Feature Selection Conclusion

Tree models contain within themselves a degree of feature selection. KNN can benefit from a reduced feature set and centring and scaling of the features. As the KNN algorithm works on finding the ‘nearest’ neighbour, scaling and centring the features ensures that they all have the same scale, and therefore, influence on the model. Leaving the features unscaled means that a feature with a large scale, like salary, could have a disproportionate influence on the model compared to a feature with a smaller scale, like age.

As such I will run all models on 2 feature sets. The first is the full feature set, the second is a concentrated feature set (brand, salary, age). The independent features have been chosen based on the decision tree variable importance and the supplementary data analysis which showed a clear pattern between these features.

4 Modelling

4.1 Modelling Preparation

#Creating Testing/Training sets
set.seed(111)
trainIndex <- createDataPartition(data$brand, p = 0.75, list = FALSE)
training <- data[ trainIndex,]
testing  <- data[-trainIndex,]

concData <- data %>% select(brand, age, salary)
training.conc <- concData[ trainIndex,]

#set up parallel processing (requires parallel and doParallel libraries)
cluster <- makeCluster(detectCores() - 1) 
registerDoParallel(cluster)

#Cross Validation 10 fold
fitControl<- trainControl(method = "cv", number = 10, savePredictions = TRUE, allowParallel = TRUE)

A test and traing set are created with all features. An additional training set is created with just Brand, Age and Salary. 10-fold cross validation is being used and parallel processing is being leveraged.

4.2 Modelling K-Nearest Neighbour

4.2.1 Training KNN

4.2.1.1 KNN - All Features

# Deploy KNN model 
model.KNN.brand <- train(brand ~ ., data = training, 
                         method = "knn", trControl = fitControl)
model.KNN.brand
## k-Nearest Neighbors 
## 
## 7501 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6750, 6751, 6751, 6750, 6751, 6750, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6776399  0.3113201
##   7  0.6872399  0.3322341
##   9  0.6968388  0.3535302
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

4.2.1.2 KNN - Concentrated Features

# Deploy KNN model with concentrated feature set
model.KNN.brand.conc <- train(brand ~ ., data = training.conc, 
                              method = "knn", trControl = fitControl)
model.KNN.brand.conc
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6751, 6751, 6751, 6751, 6751, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6943070  0.3477357
##   7  0.7003066  0.3596805
##   9  0.7083050  0.3756759
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

4.2.1.3 KNN - All Features, Scaled and Centred

# Deploy KNN model with scaled features
model.KNN.brand.scale <- train(brand ~ ., data = training, 
                         method = "knn", trControl = fitControl, 
                         preProcess = c("center", "scale"))
model.KNN.brand.scale
## k-Nearest Neighbors 
## 
## 7501 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6750, 6751, 6752, 6751, 6751, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa     
##   5  0.5681921  0.05636529
##   7  0.5713877  0.04898238
##   9  0.5771240  0.04558751
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

4.2.1.4 KNN - Concentrated Features, Scaled and Centred

# Deploy KNN model with scaled features and concentrated feature set
model.KNN.brand.conc.scale <- train(brand ~ ., data = training.conc, 
                              method = "knn", trControl = fitControl, 
                              preProcess = c("center", "scale"))
model.KNN.brand.conc.scale
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6750, 6751, 6751, 6751, 6751, 6750, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9144069  0.8180983
##   7  0.9164072  0.8224060
##   9  0.9188077  0.8274779
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

4.2.1.5 KNN - best model tuned

KNN scaled and centred with the concentrated featureset was the most promising model. This model is tuned to improve performance.

#tune best KNN model further
tGridKNN <- expand.grid(k = c(6,7,8,9,10))
finalKNN <- train(brand ~ ., data = training.conc,
                  method = "knn", trControl = fitControl,
                  preProcess = c("center", "scale"),
                  tuneGrid = tGridKNN)
finalKNN
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6751, 6751, 6751, 6750, 6750, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    6  0.9162791  0.8220349
##    7  0.9174782  0.8246540
##    8  0.9170787  0.8238580
##    9  0.9192106  0.8283923
##   10  0.9177456  0.8252175
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

By tuning the model I was able to increase the accuracy on the training sent to 92% and the Kappa to 0.83.

4.2.2 Testing KNN

KNN (k = 9) scaled and centred with the concentrated featureset was deployed on the test set.

#Predictions on the test set (model = model name, testing = test set)
predictions.KNN <- predict(finalKNN, testing)
testing$predictions.KNN <- predictions.KNN


#Confusion matrix
confMatrixKNN <- confusionMatrix(testing$brand, testing$predictions.KNN)
confMatrixKNN
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  839  106
##          1  111 1443
##                                           
##                Accuracy : 0.9132          
##                  95% CI : (0.9014, 0.9239)
##     No Information Rate : 0.6198          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8156          
##  Mcnemar's Test P-Value : 0.786           
##                                           
##             Sensitivity : 0.8832          
##             Specificity : 0.9316          
##          Pos Pred Value : 0.8878          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.3802          
##          Detection Rate : 0.3357          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.9074          
##                                           
##        'Positive' Class : 0               
## 
metrics.KNN <- postResample(pred = testing$predictions.KNN, obs = testing$brand)
metrics.KNN
##  Accuracy     Kappa 
## 0.9131653 0.8155566
fourfoldplot(confMatrixKNN$table, conf.level = 0, margin = 2, main = "Confusion Matrix KNN")

The Accuracy of the model on the test set was 91% with a kappa of 0.82. These seem to be satisfactory outcomes considering the clear delineation of the data.

4.3 Modelling Random Forest

4.3.1 Training Random Forest

4.3.1.1 Random Forest on full feature set

# Deploy RF model 
set.seed(111)
tGridRF <- expand.grid(mtry = c(16,18,20))
model.RF.brand.tune <- train(brand ~ ., data = training, 
                        method = "rf", trControl = fitControl,
                        tuneGrid = tGridRF)
model.RF.brand.tune
## Random Forest 
## 
## 7501 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6751, 6751, 6750, 6751, 6752, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   16    0.9249429  0.8405087
##   18    0.9252090  0.8411582
##   20    0.9237432  0.8379887
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.

4.3.1.2 Random Forest on concentrated feature set

## Deploy RF model on concentrated feature set
tGridRF.conc <- expand.grid(mtry = c(1,2,3))
model.RF.brand.conc <- train(brand ~ ., data = training.conc, 
                         method = "rf", trControl = fitControl,
                         tuneGrid = tGridRF.conc)
model.RF.brand.conc 
## Random Forest 
## 
## 7501 samples
##    2 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6752, 6751, 6750, 6751, 6751, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.9148129  0.8190576
##   2     0.9052142  0.7981574
##   3     0.9058807  0.7997563
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

The Accuracy of the best Random Forest model was 93% with a Kappa of 0.84.

4.3.2 Testing Random Forest

Random Forest on the full feature set with mtry = 18 was deployed on the test set.

## Test Random Forest
predictions.RF <- predict(model.RF.brand.tune, testing)
testing$predictions.RF <- predictions.RF

#Confusion matrix
confMatrixRF <- confusionMatrix(testing$brand, testing$predictions.RF)
confMatrixRF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  838  107
##          1  107 1447
##                                          
##                Accuracy : 0.9144         
##                  95% CI : (0.9027, 0.925)
##     No Information Rate : 0.6218         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8179         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8868         
##             Specificity : 0.9311         
##          Pos Pred Value : 0.8868         
##          Neg Pred Value : 0.9311         
##              Prevalence : 0.3782         
##          Detection Rate : 0.3353         
##    Detection Prevalence : 0.3782         
##       Balanced Accuracy : 0.9090         
##                                          
##        'Positive' Class : 0              
## 
metrics.RF <- postResample(pred = testing$predictions.RF, obs = testing$brand)
metrics.RF
##  Accuracy     Kappa 
## 0.9143657 0.8179179
fourfoldplot(confMatrixRF$table, conf.level = 0, margin = 1, main = "Confusion Matrix RF")

Random Forest gives an Accuracy of 91% and a kappa of 0.82.

4.4 Modelling Gradient Boosted Machine

4.4.1 Training GBM

4.4.1.1 GBM on Full Featureset

##GBM
set.seed(111)
tGridGBM <- expand.grid(n.trees = c(150,200,250), 
                        interaction.depth = c(3,4), shrinkage = 0.1, 
                        n.minobsinnode = c(5,10))

model.GBM.brand <- train(brand ~ ., data = training, 
                         method = "gbm", trControl = fitControl,
                         verbose = FALSE, tuneGrid = tGridGBM)
model.GBM.brand
## Stochastic Gradient Boosting 
## 
## 7501 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6751, 6751, 6750, 6751, 6752, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.minobsinnode  n.trees  Accuracy   Kappa    
##   3                   5              150      0.9200129  0.8314645
##   3                   5              200      0.9241448  0.8397549
##   3                   5              250      0.9265432  0.8446125
##   3                  10              150      0.9213447  0.8339949
##   3                  10              200      0.9246782  0.8407339
##   3                  10              250      0.9252119  0.8415678
##   4                   5              150      0.9246775  0.8409505
##   4                   5              200      0.9260101  0.8435279
##   4                   5              250      0.9256108  0.8426225
##   4                  10              150      0.9254759  0.8425262
##   4                  10              200      0.9261420  0.8437455
##   4                  10              250      0.9244087  0.8399692
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 5.

4.4.1.2 GBM on Condensed Featureset

#concentrated featureset
model.GBM.brand.conc <- train(brand ~ ., data = training.conc, 
                             method = "gbm", trControl = fitControl,
                             verbose = FALSE, tuneGrid = tGridGBM)

model.GBM.brand.conc
## Stochastic Gradient Boosting 
## 
## 7501 samples
##    2 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6751, 6752, 6751, 6751, 6750, 6750, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.minobsinnode  n.trees  Accuracy   Kappa    
##   3                   5              150      0.9244049  0.8398714
##   3                   5              200      0.9250712  0.8412108
##   3                   5              250      0.9237383  0.8384823
##   3                  10              150      0.9264051  0.8445855
##   3                  10              200      0.9260042  0.8434867
##   3                  10              250      0.9240055  0.8390402
##   4                   5              150      0.9241388  0.8393019
##   4                   5              200      0.9222734  0.8352171
##   4                   5              250      0.9226727  0.8360535
##   4                  10              150      0.9242723  0.8396606
##   4                  10              200      0.9250730  0.8412356
##   4                  10              250      0.9240071  0.8390608
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

The best result for the tuned GBM model on the training set was Accuracy of 93% and 0.84 for Kappa.

4.4.2 Testing GBM

GBM with the full featureset and tuned as followed was used: n.trees = 250, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 5.

## Test GBM
predictions.GBM <- predict(model.GBM.brand, testing)
testing$predictions.GBM <- predictions.GBM

#Confusion matrix
confMatrixGBM <- confusionMatrix(testing$brand, testing$predictions.GBM)
confMatrixGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  851   94
##          1  114 1440
##                                           
##                Accuracy : 0.9168          
##                  95% CI : (0.9052, 0.9273)
##     No Information Rate : 0.6138          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8238          
##  Mcnemar's Test P-Value : 0.1877          
##                                           
##             Sensitivity : 0.8819          
##             Specificity : 0.9387          
##          Pos Pred Value : 0.9005          
##          Neg Pred Value : 0.9266          
##              Prevalence : 0.3862          
##          Detection Rate : 0.3405          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.9103          
##                                           
##        'Positive' Class : 0               
## 
metrics.GBM <- postResample(pred = testing$predictions.GBM, obs = testing$brand)
metrics.GBM
##  Accuracy     Kappa 
## 0.9167667 0.8237539
fourfoldplot(confMatrixGBM$table, conf.level = 0, margin = 2, main = "Confusion Matrix GBM")

For the GBM applied to the test set, the Accuracy was 92% with a Kappa of 0.82.

4.5 Modelling Conclusions

##Turn off parallel processing
stopCluster(cluster)
registerDoSEQ()

4.5.1 Model Comparisons

##Convert predictions to numerics to allow calculations
testing$brand <- testing$brand %>% as.numeric()
testing$brand <- testing$brand -1

testing$predictions.KNN <- testing$predictions.KNN %>% as.numeric()
testing$predictions.KNN <- testing$predictions.KNN - 1

testing$predictions.RF <- testing$predictions.RF %>% as.numeric()
testing$predictions.RF <- testing$predictions.RF - 1

testing$predictions.GBM <- testing$predictions.GBM %>% as.numeric()
testing$predictions.GBM <- testing$predictions.GBM - 1

#AUC

#AUC KNN
pred.KNN <- prediction(testing$predictions.KNN,testing$brand)
auc.perf.KNN <- performance(pred.KNN, measure = "auc")

#AUC RF
pred.RF <- prediction(testing$predictions.RF,testing$brand)
auc.perf.RF <- performance(pred.RF, measure = "auc")

#ROC GBM
pred.GBM <- prediction(testing$predictions.GBM,testing$brand)
auc.perf.GBM <- performance(pred.GBM, measure = "auc")

#create results table
mNam <- c("KNN", "RF", "GBM")
Acc <- c(metrics.KNN[1], metrics.RF[1], metrics.GBM[1])
Kap <- c(metrics.KNN[2], metrics.RF[2], metrics.GBM[2])
Sens <- c(confMatrixKNN$byClass[1], confMatrixRF$byClass[1], confMatrixGBM$byClass[1])
Spec <- c(confMatrixKNN$byClass[2], confMatrixRF$byClass[2], confMatrixGBM$byClass[2])
auc <- c(auc.perf.KNN@y.values[[1]], auc.perf.RF@y.values[[1]], auc.perf.GBM@y.values[[1]])
modelResults <- data.frame(mNam,Acc,Kap,Sens,Spec,auc)
colnames(modelResults) <- c("Model", "Accuracy", "Kappa", "Sensitivity", "Specificity","AUC")


kable(modelResults) %>%
        kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Model Accuracy Kappa Sensitivity Specificity AUC
KNN 0.9131653 0.8155566 0.8831579 0.9315688 0.9082011
RF 0.9143657 0.8179179 0.8867725 0.9311454 0.9089590
GBM 0.9167667 0.8237539 0.8818653 0.9387223 0.9135850

All models performed well once correctly selected and tuned, managing over 90% Accuracy in all cases.

The KNN model which performed the best was based on a concentrated feature set and had the features centred and scaled. The Random Forest model which performed best was that which used the full feature set.

The best performing model, however, was the Gradient Boosting Machine deployed on a full feature set. It had not only the best accuracy of 92% but also the better metrics in almost every category.

4.5.2 Final Model Insights

##Explore Model Insights
summary(model.GBM.brand) %>% head(5)

##               var     rel.inf
## salary     salary 50.37718256
## age           age 46.87443845
## credit     credit  1.66492705
## elevel4   elevel4  0.11163407
## zipcode4 zipcode4  0.06788312

We see that of the features that were taken into the model, only 2 play a significant role, with Salary having a slightly bigger effect than Age. These 2 features account for 97% of the relative influence of all the features.

testing$predictions.GBM <- testing$predictions.GBM %>% as.factor()
g13 <- ggplot(testing, aes(salary, age, colour = predictions.GBM)) +
        geom_point(show.legend = FALSE) +
        facet_grid(predictions.GBM ~ .) +
        theme_bw() +
        scale_color_brewer(palette="Dark2") +
        xlab("Salary") + 
        ylab("Age") + 
        ggtitle("Salary v Age by GBM prediction")
g13

g12 

Plotting the predictions of the model by Salary v Age and colouring and faceting by brand preference, we see that the model has done a good job of replicating the pattern that we found during our pre-modelling exploratory data analysis.

I feel confident that we are now able to fairly well define 3 blocks of customers who prefer brand 0 (Acer), with all others preferring brand 1 (Sony).

The model captures the delineation more accurately and there is some bleeding around the margins but in essence the following customers prefer Acer:

  • Salary 25k - 75k, Age 60 - 80
  • Salary 75k - 125k, Age 40 - 60
  • Salary 50k - 100k, Age 20 - 40

4.5.3 Predicting Brand Preference

#Update data types
incomplete$elevel <- incomplete$elevel %>% as.factor()
incomplete$car <- incomplete$car %>% as.factor()
incomplete$zipcode <- incomplete$zipcode %>% as.factor()
incomplete$brand <- incomplete$brand %>% as.factor()

#make predictions
predictions.incomplete <- predict(model.GBM.brand, incomplete)
incomplete$brand <- predictions.incomplete

4.5.4 Final Combined Data

#Combine data
totalData <- bind_rows(data, incomplete)
totalData$brand <- totalData$brand %>% factor(labels=c('Acer','Sony'))
save(totalData, file = "totalData.RDS")

The original data and the predicted data from the incomplete responses has now been combined and is available for the sales team to review.

5 Conclusions

table(totalData$brand)
## 
## Acer Sony 
## 5716 9284
g15 <- ggplot(totalData, aes(brand, fill = brand)) + 
        geom_bar() +
        theme_bw() +
        scale_fill_brewer(palette="Dark2") +
        xlab("Brand Preference") + 
        ylab("Frequency") + 
        ggtitle("Total Brand Preference Frequencies")
g15

We see that based purely on numbers, Acer is 60% as popular as Sony. Taking the decision purely on that basis, any strategic partnership that is pursued should be with Sony.

I would strike a note of caution, however. Whilst Sony is the more popular (62% of customers) and Acer the less (38% of customers), the customers preferring Acer still comprise a substantial minority. These customers may be alienated should Acer products become less available.

5.1 Recommendations

I would recommend that we further explore the potential effects of pursuing a strategic partnership with Sony on the behaviours of the current customer base who perfer Acer. This could be done by further research into the reasons for preferring one brand over another to create profiles of “Sony” and “Acer” customers beyond brute demographics. Perhaps there are Sony products that would serve Acer customers if we understood more about the reasons for their preferences.